The “Fashion” in Economic Research

The world of academic research, much like fashion, has distinct trends. Methodologies and topics rise and fall in popularity over time.

This project was inspired by a chart in The Economist that visualized this very phenomenon in 2016.

Goal

  • Can I replicate this analysis?
  • Can I build a tool to explore any term, not just pre-selected ones?
  • Can I automate the process so the data is always current?

Step 1: The Automated Data Pipeline

The foundation of the project is an automated pipeline that runs weekly using GitHub Actions. It ensures the app’s data is always up-to-date.

Part 1: Ingestion & Cleaning

  1. A scheduled job downloads the latest metadata from the NBER website.

  2. It checks against a local SQLite database and appends only new papers.

  3. It cleans the data, ensuring dates are valid and text fields are standardized.

Part 2: Processing & Preparation

  1. A second script reads the clean data from the SQLite database.

  2. It combines titles and abstracts into a single, searchable text field.

  3. The final, enriched dataset is saved as a Parquet file, ready for the app.


NBER-METHODS-ETL/
├── .github/
│   └── update_data.yml
├── app/
│   └── app.py
├── data/
│   ├── nber_papers.db
│   └── nber_full_text.parquet
├── pipeline/
│   ├── 01_ingest_data.py
│   └── 02_process_text.py
├── requirements.txt
└── README.md

Step 2: Exploring the Data (EDA)

Before building the app, I explored the data to find interesting patterns.

Top Terms (TF-IDF)

I used TF-IDF to find the most characteristic terms for each year. This reveals the “buzzwords” of the time.

Year Top 5 Characteristic Terms
2020 covid, pandemic, lockdown, social distancing, stimulus
2010 financial crisis, stimulus, mortgage, banks, foreclosure
2000 technology, internet, productivity, stock returns, 90s
1990 soviet union, trade liberalization, wage inequality, fdi, nafta

Step 3: The Interactive App

The final result is a Streamlit web application that brings all this data to life.

Try It Yourself!

nber-methods-etl.streamlit.app

Conclusion & Future Work

Conclusion

  • It is possible to build a fully automated, publicly available tool for exploring research trends.

  • The interactive nature of Streamlit allows for much deeper and more personalized exploration than a static chart.

Future Work

  • Performance: Implement a pre-computed inverted index for near-instant search results.

  • New Features:

    • Author-level analysis.
    • Shareable URLs to save and send specific views.
  • Advanced Analysis:

    • Use word embeddings for semantic search.

Code Explained!

The Goal: 01_ingest_data.py

This script is the first and most critical step in our automated data pipeline.

Its primary job is to:

  1. Download the latest paper metadata from the NBER website.

  2. Compare it against our local SQLite database.

  3. Append only the new papers, ensuring our data is always up-to-date without duplication.

Configuration and Setup

A well-structured script starts with clear configuration. We define constants for our data source and local database paths. Using pathlib makes path manipulation clean and OS-agnostic.

import pandas as pd
import sqlite3
from pathlib import Path
import csv

# --- Configuration ---
# Define the base URL for NBER data
BASE_URL = "http://data.nber.org/nber_paper_chapter_metadata/tsv/"

# Define project structure paths
PROJECT_ROOT = Path(__file__).resolve().parents[1]
DB_PATH = PROJECT_ROOT / "data" / "03_primary" / "nber_papers.db"
TABLE_NAME = "papers"

Downloading the Raw Data

Next, we fetch the data directly from the NBER’s TSV (Tab-Separated Values) files using pandas.

A try...except block makes our script resilient to network errors or changes in the source URL.

def ingest_and_update_db():
    """..."""
    try:
        print("Downloading core reference data (ref.tsv)...")
        latest_df = pd.read_csv(
            f"{BASE_URL}ref.tsv",
            sep='\t',
            names=['paper', 'author', 'title', 'issue_date', 'doi'],
            header=0,
            engine='python',
            quoting=csv.QUOTE_NONE
        )
        print(f"Successfully loaded {latest_df.shape[0]} total records from ref.tsv.")

Note: The quoting=csv.QUOTE_NONE is important because the source TSV files have unquoted special characters that can break the default parser.

Merging with Abstracts

The metadata and abstracts are in separate files. We download the abstracts and use pd.merge to join them into a single, unified DataFrame. A left join ensures we keep all papers, even if an abstract is missing.

        print("Downloading abstracts (abs.tsv)...")
        abstracts_df = pd.read_csv(
            f"{BASE_URL}abs.tsv",
            sep='\t',
            names=['paper', 'abstract'],
            header=0,
            engine='python',
            quoting=csv.QUOTE_NONE
        )
        latest_df = pd.merge(latest_df, abstracts_df, on="paper", how="left")
        print("Successfully merged abstracts.")

Data Cleaning

Real-world data is often messy. The issue_date column can contain invalid entries like ‘0000-00-00’.

We handle this gracefully:

  1. Use pd.to_datetime with errors='coerce' to turn any invalid date formats into NaT (Not a Time).
  2. Use .dropna() to remove these rows, ensuring high data quality downstream.
    # --- MORE Data Cleaning ---
    print("Cleaning 'issue_date' column...")

    # Convert to datetime, coercing errors will turn invalid dates like '0000-00-00' into NaT
    latest_df['issue_date'] = pd.to_datetime(latest_df['issue_date'], errors='coerce')
    
    # Drop rows where the date conversion resulted in NaT (Not a Time) using reassignment
    latest_df = latest_df.dropna(subset=['issue_date'])

Checking the Existing Database

Before we add new data, we need to know what we already have. We connect to the SQLite database and read all existing paper IDs into a Python set for highly efficient lookups.

    # 3. Connect to SQLite DB and get existing paper IDs
    existing_ids = set()
    try:
        conn = sqlite3.connect(DB_PATH)
        if pd.io.sql.table_exists(TABLE_NAME, conn):
            print(f"Database found at {DB_PATH}. Fetching existing paper IDs...")
            existing_ids_df = pd.read_sql(f"SELECT paper FROM {TABLE_NAME}", conn)
            existing_ids = set(existing_ids_df['paper'])
            print(f"Found {len(existing_ids)} existing papers in the database.")
        else:
            print("No existing database found. A new one will be created.")
        conn.close()
    except Exception as e:
        # ...

Identifying New Papers

This is the key step for preventing duplicates. We use the power of pandas boolean indexing and the isin() method. This one line of code efficiently filters our downloaded DataFrame, keeping only the rows whose ‘paper’ ID is not in our existing_ids set.

    new_papers_df = latest_df[~latest_df['paper'].isin(existing_ids)]

Appending New Data to the Database

If any new papers were found, we connect to the database again and append them. Using to_sql with if_exists='append' is crucial. It tells pandas to add the new rows to the existing table instead of creating a new one or erroring out.

    if not new_papers_df.empty:
        print(f"Found {len(new_papers_df)} new papers to add.")
        try:
            conn = sqlite3.connect(DB_PATH)
            new_papers_df.to_sql(
                TABLE_NAME,
                conn,
                if_exists='append',
                index=False
            )
            conn.close()
            print("Successfully appended new papers to the database.")

Executing the Script

Finally, the if __name__ == '__main__': block ensures the ingest_and_update_db function runs only when the script is executed directly. This is standard Python practice that allows functions from this script to be safely imported into other files without side effects.

if __name__ == '__main__':
    newly_added_papers = ingest_and_update_db()
    if not newly_added_papers.empty:
        print("\n--- Summary of Newly Added Papers ---")
        print(newly_added_papers[['paper', 'title', 'issue_date']].head())
    print("\n--- Pipeline execution finished. ---")

This completes the automated, incremental update of our NBER database!

Github Actions Workflow